Hive: Optimization Techniques

There are several types of Hive Query Optimization techniques are available while running our hive queries to improve Hive performance with some Hive Performance tuning techniques. So, in this Hive Optimization Techniques article, Hive Optimization Techniques for Hive Queries we will learn how to optimize hive queries to execute them faster on our cluster, types of Hive Optimization Techniques for Queries: Execution Engine, Usage of Suitable File Format, Hive Partitioning, Bucketing in Apache Hive, Vectorization in Hive, Cost-Based Optimization in Hive, and Hive Indexing.

Types of Query Optimization Techniques in Hive

Tez-Execution Engine in Hive
Usage of Suitable File Format in Hive(ORCFILE Formate)
Hive Partitioning
Bucketing in Hive
Vectorization In Hive
Cost-Based Optimization in Hive (CBO)
Hive Indexing

Tez-Execution Engine in Hive:- Hive Optimization Techniques, to increase the Hive performance of our hive query by using execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework. In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.

Enabling Tez:- Tez can be enabled for the entire cluster at the bootstrap level or for individual queries at run-time by setting set hive.execution.engine = tez. If administrators configure Tez for the entire cluster then individual queries can be reverted back to MapReduce by setting set hive.execution.engine = mr at the start of the job.

Enabling Tez:- set hive.execution.engine =spark

Usage of Suitable File Format in Hive:- ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats. To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.

Hive Partitioning:- Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive. Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.

Bucketing in Hive:- Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.

Vectorization in Hive:- Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type).However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:

Enable Hive Vectorization.
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true

When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.

Supported Functionality

The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.The supported operators are selection, filter and group by.Partitioned tables are supported.

Cost-base Optimization in Hive:- Before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query. However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.

To use CBO, set the following parameters at the beginning of your query:

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.

Difference between Sort By and Order By

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order. It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.

SELECT emp_id, emp_salary FROM employees SORT BY emp_salary DESC;

Reducer-1
emp_id | emp_salary
10             5000
16             3000
13             2600
19             1800

Reducer-2
emp_id | emp_salary
10             5000
16             3000
13             2600
19             1800

As, we can see, values in each reducer output is ordered but total ordering is missing , since we end up with multiple outputs per reducer. In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single reducer, which is normally performance intensive and therefore in strict mode, hive makes it compulsory to use LIMIT with ORDER BY so that reducer doesn’t get overburdened.
Ordering : Total Ordered data.
Outcome : Single output i.e. fully ordered.
For example :

SELECT emp_id, emp_salary FROM employees ORDER BY emp_salary DESC;

emp_id | emp_salary
10             5000
11             4000
17             3100
16             3000
13             2600
14             2500
20             2000
19             1800

Distribute By:- It ensures each of N reducers gets non-overlapping ranges of column, but doesn’t sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.

We are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducre 1

Reducre 2
x4
x3

Cluster By:- CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.
Ordering : Global ordering between multiple reducers.
Outcome : N or more sorted files with non-overlapping ranges.
For the same example as above , if we use Cluster By x, the two reducers will further sort rows on x:

Reducre 1

Reducre 2
x4
x3

How to load XML Data into Hive table?

<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>

CREATE TABLE book_details(TITLE STRING, AUTHOR STRING,COUNTRY STRING,COMPANY STRING,PRICE FLOAT,YEAR INT)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.TITLE"="/BOOK/TITLE/text()",
"column.xpath.AUTHOR"="/BOOK/AUTHOR/text()",
"column.xpath.COUNTRY"="/BOOK/COUNTRY/text()",
"column.xpath.COMPANY"="/BOOK/COMPANY/text()",
"column.xpath.PRICE"="/BOOK/PRICE/text()",
"column.xpath.YEAR"="/BOOK/YEAR/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

ADD Jar file
path/hivexmlserde-1.0.0.0.jar

https://saurzcode.in/2017/10/configure-spark-application-eclipse/#more-1101

Hive

Optimization Techniques

No comments:

Post a Comment